Investigate Red Wine Quality by Felix Lou

About

In this project, I am going to explore a dataset in which contains information on red wine quality and chemical properties associated with them. With the help of the statistical program R, I am going to first conduct preliminary investigation to see if there are any relationships among variables, and further illustrate them with plots. The dataset is available for download here, and its documentation is available here.

Preliminary Investigation

Let’s run some basic functions to have a glance of the dataset.

# Check the general structure of the dataset
str(df)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
# Double check to see if there are any dupicates
anyDuplicated(df)
## [1] 0
# A glimpse on the statistical summaries on each variable
summary(df)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Some findings at first glance:

  • There are altogether 1599 observations of 13 numeric variables.
  • X seems to be the ID of each wine
  • Quality-wise, it seems most wines converge towards the middle on a 0-10 score scale, with a max of 8, min of 3, median of 6, and mean of 5.636. quality here is supposed to be a categorical vairable
  • According to the documentation, some variables, like the acids, may be correlated with each other. This implies there may be multicollinearity
  • Apart from quality and .sulfur.dioxide, all other variables are continuous

A Glimpse of the Distribution

Let’s make a set of histograms for all the variables to have a general idea on the distribution first.

Univariate Analysis

Quality of Wine

Looking at the histogram of wine quality, we can see that most of the wines receive about average ratings which are based on a discrete range, and there are not really any extreme cases (Outliers). Although not obvious, there is some level of normal distribution.

Based on this distribution, let’s instantiate another variable on the basis of ratings. Three categories, ‘good’, ‘average’, and ‘bad’ will represent wines that receive rating of 7 or above, 5 or 6, and below 5 respectively.

## 
##     bad average    good 
##      63    1319     217

Acidity of Wine

Acids are major wine constituents and contribute greatly to its taste. According to the documentation and by researching online, fixed.acidity and volatile.acidity are two different types of acids (tartaric and acetic); so let’s instantiate another variable that stands for overall acidity of wine.

High Level Comparison on Good and Bad

Although by looking at the above histograms, we can tell the general properties that wines have, we cannot really tell what separates the good ones from the bad ones.

We need to perform a comparison in order to tell the difference. Let’s first compare the overall acidity between the good ones and bad ones.

I believe there are reasons why wine experts divide acidity into two groups; so let’s dig deeper by plotting the acids individually along with other variables.

Here, we plot the probability density functions of the good ones and the bad ones. Generally speaking, we want to focus on the area where the two groups do not overlap becasue that can serve as a reference of what separates the good ones from the bad ones. In other words, variables in which the two groups have obvious difference in distribution could be indicators of wine quality, or more or less help us predict wine quality.

Looking at the above plots, we can see there are relatively obvious differences in volatile.acidity, citric.acid, pH, sulphates, and alcohol. And if we look at the plots individually, we can see that the ranges of spikes are especially wide among citric.acid and alcohol (On its own scale). This suggests that the two variables might serve as indicators of wine quality.

Outliers

As we saw from the plots above, some variables’ distribution is skewed. This suggests that there might be outliers in the dataset. Let’s validate this with the help of boxplot.

Now we have a much clearer picture on outliers.

Bivariate Analysis

From the density plots, we can see some variables may have an impact on wine quality. Let’s see if we can capture the trend/ tendency with the help of boxplot.

Now the relationships between certain variables and wine quality are much clearer. In general, the steeper the boxes are positioned against each other, the greater the impact of that specific variable on wine quality. Here, fixed.acidity, volatile.acidity, pH, citric.acid, sulphates, and alcohol all seem to have impact on wine quality. This is especially true for alcohol, sulphates, volatile.acidity, and citric.acid. Let’s double check by calculating the correlation coefficient of each variable against quality.

##                             [,1]
## fixed.acidity         0.12405165
## volatile.acidity     -0.39055778
## citric.acid           0.22637251
## residual.sugar        0.01373164
## chlorides            -0.12890656
## free.sulfur.dioxide  -0.05065606
## total.sulfur.dioxide -0.18510029
## density              -0.17491923
## pH                   -0.05773139
## sulphates             0.25139708
## alcohol               0.47616632
## overall.acidity       0.10375373

To summarize:

One thing that interests me is that although good wines generally have relatively high acidity, they also in general possess lower level of volatile.acidity. As mentioned in the documentation, the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. It seems volatile.acidity is negatively correlated with other acids.

Let’s plot volatile.acidity against citric.acid and fixed.acidity.

## 
##  Pearson's product-moment correlation
## 
## data:  df$volatile.acidity and df$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

## 
##  Pearson's product-moment correlation
## 
## data:  df$volatile.acidity and df$fixed.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3013681 -0.2097433
## sample estimates:
##        cor 
## -0.2561309

As expected, volatile.acidity is negatively correlated with other acids. This is especially true for citric.acid, given the correlation coefficient is -0.552.

According to the documentation, total.sulfur.dioxide includes free.sulfur.dioxide. This implies that there should be a correlation between the two variables.

## 
##  Pearson's product-moment correlation
## 
## data:  df$free.sulfur.dioxide and df$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

As expected, there is such a strong positive correlation of 0.668 between the two variables.

According to the documentation, density is a variable that depends on alcohol level and sugar content. This suggests that there may be a correlation between density against residual.sugar and alcohol.

## 
##  Pearson's product-moment correlation
## 
## data:  df$residual.sugar and df$density
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3116908 0.3973835
## sample estimates:
##       cor 
## 0.3552834

## 
##  Pearson's product-moment correlation
## 
## data:  df$alcohol and df$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

The results are as what we expected. residual.sugar is positively correlated with density, and alcohol is negatively correlated with density.

pH should be hugely affected by wines acidity as pH is essentially a measure of acidity; so let’s plot overall.acidity against pH.

## 
##  Pearson's product-moment correlation
## 
## data:  df$overall.acidity and df$pH
## t = -37.418, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7087579 -0.6564574
## sample estimates:
##        cor 
## -0.6834838

The results are as expected. overall.acidity is negatively correlated with pH.

Multivariate Analysis

The last plot seems to be the most linear compared to others. Sush a strong coorelation suggests that we might use a linear model to predict pH based on overall.acidity.

According to the boxplot, the residuals are generally consistent across qualities, except the ones with the lowest quality. The ones with quality 3 has a median well below 0. Supposedly, acidity itself should be the biggest factor that alters pH. Nonetheless, this does not really apply to this group of wine. Chances are there might be omitted variables.

Previously, we spotted numbers of correlations among variables. Now let’s visualize them on rating and quality basis, and see if the pattern is consistant across groups.

The patterns are consistent across groups.

Previously, we found that the ranges of spikes are especially wide among citric.acid and alcohol (On its own scale) when doing univariate analysis. We also found that there are relatively strong correlations between quality, volatile.acidity, and sulphates. Let’s see how the good ones and bad ones are distributed in a scatter plot.

By plotting them in scatter plots, our ideas are further validated. Although the plots do not have a crystal clear generalization, they provide us with a pretty clear picture on what separate good wines from bad wines.


Final Plots and Summary

Plot One - Probability Density Function on Citric Acid and Alcohol

This first pair of plots shows how wine groups are distributed based on the variables. The variable is a fairly good indicator when the distributions do not overlap and start to get farther from each other.

Plot Two - Volatile Acidity & Sulphates Level against Quality

This second pair of plots shows how a drop in valitility.acidity and a rise in sulphates can enhance wines’ quality. Although they cannot tell the whole story, the plots are good enough for us to say that these two variables should not be neglected.

Plot Three - Good & Bad Wines Generalization

This third pair of plots further demonstrates the idea of the first pair of plots. Here, we can clearly see that where certain wines are clustered. We can be even more confident to say that these variables are some of the key factors to consider when evaluating wine.


Reflection

In this EDA, I was able to identity some of the key factors that have impact on wine quality. Although wine quality is arguably subjective, the results we got from this analysis are reasonable. At least we can say that these vairables that we investigated do play a role on wine quality according to conventional/ industry standard. Generally speaking, acidity, sulphates level, and alcohol level are the ones that could alter wine quality most. There could be more probabilistic work if we want to run hypothesis testing on certain statistical summaries like the difference in alcohol means among groups.